Exploring the diagnostic performance of machine learning in prediction of metabolic phenotypes focusing on thyroid function

In this study, we employed various machine learning models to predict metabolic phenotypes, focusing on thyroid function, using a dataset from the National Health and Nutrition Examination Survey (NHANES) from 2007 to 2012. Our analysis utilized laboratory parameters relevant to thyroid function or metabolic dysregulation in addition to demographic features, aiming to uncover potential associations between thyroid function and metabolic phenotypes by various machine learning methods. Multinomial Logistic Regression performed best to identify the relationship between thyroid function and metabolic phenotypes, achieving an area under receiver operating characteristic curve (AUROC) of 0.818, followed closely by Neural Network (AUROC: 0.814). Following the above, the performance of Random Forest, Boosted Trees, and K Nearest Neighbors was inferior to the first two methods (AUROC 0.811, 0.811, and 0.786, respectively). In Random Forest, homeostatic model assessment for insulin resistance, serum uric acid, serum albumin, gamma glutamyl transferase, and triiodothyronine/thyroxine ratio were positioned in the upper ranks of variable importance. These results highlight the potential of machine learning in understanding complex relationships in health data. However, it’s important to note that model performance may vary depending on data characteristics and specific requirements. Furthermore, we emphasize the significance of accounting for sampling weights in complex survey data analysis and the potential benefits of incorporating additional variables to enhance model accuracy and insights. Future research can explore advanced methodologies combining machine learning, sample weights, and expanded variable sets to further advance survey data analysis.


Introduction
Obesity is rapidly increasing worldwide, imposing a significant burden on health systems [1] and is associated with various adverse health outcomes especially in terms of dysregulated metabolism [2].Traditionally and practically, obesity is defined by a single dimension, body mass index (BMI).However, considering heterogeneous clinical presentations of obesity, such as the existence of metabolically healthy individuals within the obese population, a more sophisticated way of obesity classification is needed for understanding of the mechanistical association between obesity and adverse health outcomes as well as for proper management of obese population.Based on both BMI and metabolic presentations, metabolic phenotypes are classified as metabolic healthy normal weight (MHNW), metabolic healthy obesity (MHO), metabolic unhealthy normal weight (MUNW), and metabolic unhealthy obesity (MUO).Individuals with metabolically healthy obesity are known to exhibit various favorable characteristics in terms of metabolism [3], despite being obese, and have a relatively lower cardiovascular risk compared to the metabolically unhealthy population [4].However, the clinical significance of MHO, including the existence of specific management options, and the underlying mechanistic distinctions between MUO and metabolically unhealthy normal weight (MUNW) are not fully comprehended.
As thyroid hormone is a main regulator of energy metabolism [5], altered glucose [6,7] and lipid [8,9] metabolism are observed in overt thyroid diseases.Even in subtle changes of thyroid function such as subclinical hypothyroidism and hyperthyroidism, those metabolic changes are found [10,11].Furthermore, variations of thyroid function within normal reference ranges were suggested to be related to metabolic derangement [12][13][14][15] and blood pressure [16].In this context, several studies evaluated and found the association of thyroid function within reference ranges and metabolic phenotypes [17][18][19][20].However, there is currently no consistent consensus, which may be attributed to variations in the variables used, different definitions of metabolic syndrome, and variations in the populations studied.
Machine learning (ML), a form of artificial intelligence, is an automated learning process that allows programs to solve problems by analyzing data, thus facilitating data comparison and classification [21].By learning from and acting on data, it enables to generate an algorithm to predict a certain outcome.The accumulation of diverse and extensive healthcare data presents favorable conditions for the development of diagnostic and prognostic prediction models using machine learning techniques.Various ML methods have been applied in healthcare settings, particularly in the predictive analysis of conditions such as high blood pressure and diabetes [22].Recently, Analyses have also been attempted to predict various diseases beyond hypertension and diabetes through the language model analysis of symptoms [23], further exemplifying how the domain of machine learning has expanded to encompass not only the analysis of skin lesions using imaging data [24] but also the prediction and management of physiological conditions [25], demonstrating a broadening scope of applications in healthcare.In the field of thyroid research, several studies have utilized machine learning techniques for investigation.In the early stages, there was study utilizing neural networks to predict thyroid function status based on laboratory and clinical data [26].More recently, numerous research studies have been conducted using various machine learning techniques to predict the diagnosis of thyroid nodules [27].Recent study explored the factors influencing TSH levels using commonly collected demographic information and laboratory parameters [28].
Given the lack of a consistent correlation between metabolic phenotypes and thyroid function, and with the underlying mechanisms still undisclosed, there is an opportunity to leverage artificial intelligence (AI) for the development of a predictive model for metabolic phenotypes.To date, limited research has explored the ability of machine learning methods to accurately predict metabolic phenotypes in conjunction with thyroid function.Thus, the aim of this study is to develop and evaluate the performance of machine learning methods for the prediction of metabolic phenotypes, with a special focus on thyroid function.Additional factors commonly associated with both thyroid function and metabolic phenotypes will be incorporated into the machine learning methods to create prediction models.Such an approach has the potential to shed light on the complex relationship between these variables and advance our understanding in this domain.
Hyeong Jun Ahn, as the first author, spearheads the study's conceptualization, protocol development, and manuscript drafting, collaborating with co-authors to refine the submission.Additionally, Minhee Kim provides valuable clinical expertise through an exhaustive literature review, synthesizing existing knowledge to inform the study's background and rationale.Furthermore, Kyle Ishikawa plays a pivotal role in the research endeavor, spearheading all aspects of data management, organization, and analysis, ensuring the integrity and accuracy of the study's findings.

Methods
Metabolic phenotypes, MHNW, MHO, MUNW, and MUO, were classified according to harmonized definition [29].A cutoff value of BMI 30 was applied for definition of obesity, and unhealthy phenotypes were defined as having any of the following components: (1) fasting glucose �100 mg/dl (or taking medication/insulin for diabetes), ( 2) systolic blood pressure �130 mmHg or diastolic blood pressure � 85 mmHg (or taking medication for hypertension), ( 3) triglyceride (TG) �150mg/dl, or (4) high density lipoprotein cholesterol (HDL-c) < 40 mg/dl in male, <50 mg/dl in female.We included demographic (gender, age, race, and smoking status) and thyroid laboratory test including free thyroxine, free triiodothyronine (FT3), total thyroxine (TT4), total triiodothyronine (TT3), thyrotropin (TSH), free T3/T4 ratio, total T3/T4, urine iodine creatinine ratio (UICR) and thyroid autoantibodies such as anti-thyroid peroxidase (TPO) antibody and anti-thyroglobulin (Tg) for the analysis.Laboratory variables that have been reported to be associated with metabolic dysregulation were also selected (ex.albumin, gamma glutamyl transferase, total bilirubin, uric acid, homeostatic model assessment for insulin resistance (HOMA-IR)) [30][31][32][33].Additionally, we used other laboratory variables available in the dataset for ML prediction to improve the performance of ML, even though there have been few reports on their association with thyroid function or obesity.
For the continuous variables, we summarized the data using the median and interquartile range (IQR) (FT4 quartile (Q), FT3 Q, TT4 Q, TT3 Q, TSH Q, UICR Q, HOMA-IR Q).For the categorical variables, we presented the results as percentages and counts.We performed statistical tests, such as Fisher's exact test and Kruskal-Wallis rank sum test, to assess the significance of associations between the characteristics and metabolic phenotypes.The p-values indicated the statistical significance of these associations.
We employed supervised machine learning models, including random forest, boosted trees, logistic regression, lasso regression, K nearest neighbors, and a type of neural network called a multilayer perceptron to classify the metabolic phenotype.The dataset was divided into a 80% training set and a 20% testing set for model evaluation.
Preprocessing was applied to the data based on what was required or could improve model performance.Table 1 describes the preprocessing sets in the order that they were applied for each model.Only models or parameters calculated using the training set were applied to the testing set to avoid data leakage.For example, when imputing missing data in the testing set, the bagged tree models were created only using data from the training set.First, variables with zero variance were removed from the dataset since they do not contribute valuable information.Missing values were then imputed using bagged tree models, utilizing ensemble learning to estimate those values based on dataset relationships.Continuous variables with a correlation greater than 0.8 with another variable were removed to eliminate redundancy and mitigate multicollinearity.In the multinomial logistic regression, lasso regression, and neural network models, continuous variable skewness was reduced through the application of the Yeo-Johnson transformation, promoting a more symmetric distribution.Additionally, continuous variables in the lasso regression, K nearest neighbors, random forest, boosted trees, and neural network models were normalized to have a mean of 0 and a standard deviation of 1, ensuring balanced scales.To handle missing or unknown categorical values, a novel level was created for categorical variables in all models, preventing information loss.Finally, categorical variables were converted into dummy variables, generating binary indicators for each category, enhancing the models' handling of categorical data.By applying these preprocessing steps, we aimed to enhance the quality and suitability of the dataset for each respective machine learning model.An overview of the general ML steps is presented in Fig 1.
Hyperparameter tuning was conducted for each machine learning model, except for Linear Regression, using grid search and simulated annealing (SA).This technique randomly crawls the hyperparameter space to find the optimal combination.Simulated annealing will backtrack or restart its search when suboptimal combinations are found.To prevent overfitting, we employed a 5-fold cross-validation strategy, where the training set was divided into 5 smaller sets (k = 5).The model was trained on k-1 folds and validated on the remaining portion of the dataset.This process was repeated three times with different fold splits to eliminate bias from the initial split of the folds.The combination of hyperparameters that were selected were based on their performance in 15 total models (five folds times three repeats).
Subsequently, the tuned models were tested on the testing dataset, and the accuracy of the different models were assessed using metrics such as the Hand-Till area under the receiver operator curve (AUROC), accuracy, precision, recall, and F1-score.Confusion matrices were plotted to show predicted values compared to true values.Lastly, the top 20 important variables from the random forest model were plotted based on purity-the degree to which a node can separate the classes.
Table 2 presents the key parameters utilized for each classification model, outlining the specific settings employed for model training and evaluation.

Lasso regression
Lasso regression performs feature selection by eliminating predictors that are redundant or do not decrease the prediction error.Complex relationships were created by adding every twoand three-way combination of interaction terms resulting in 19,600 new predictors.The tuned

K nearest neighbors
K nearest neighbors calculates distances from one observation to its surrounding observations, or neighbors.This model is good at predicting an outcome by examining neighbors that are the most similar to the input observation.The value of K was set to 101, indicating that the algorithm considers the 101 nearest neighbors when making predictions.

Random forest
Random forest reduces the bias from individual decision trees that uses all of the training data.A RF is an ensemble of decision trees whose nodes are a random selection of predictors.This reduces overfitting because the trees do not have access to all of the training data.Individually, they are weak models, but if many of them collected and vote on the outcome, they create a strong model.The number of trees in the forest was set to 490.Additionally, the algorithm uses 7 predictors for splitting nodes and requires a minimum node size of 45 observations.

Boosted trees
Boosted trees is an extension of RF that sequentially improves on the error from the previous tree, hence, "boosting" the weak tree.In the end, the sequence of improved trees is combined to create an ensemble.Our implementation included 1,695 trees and each tree used 9 predictors for splitting nodes.The minimum node size was set to 33, and the tree depth was limited to 7. A learning rate of 0.00405 was used to control the contribution of each tree.

Neural network
The neural network consisted of a single hidden layer with 7 units.This configuration is known as a multilayer perceptron and its hidden layer between its input and output layer is useful for modeling complex non-linear relationships.A visual of this neural network architecture has been presented in Fig 2 .To prevent overfitting, an amount of regularization equal to 7.3 x 10^(-9) was applied.The training process spanned 404 epochs, and a learn rate of 0.296 was used.The activation function employed in the hidden layer was the rectified linear unit (ReLu).
Regarding demographic variables, a significant difference was observed for gender (p < 0.001), with a higher proportion of males in the MHNW and MHO groups compared to the MUNW, MUO, and unknown groups.Age (RIDAGEYR) also showed a significant difference among the groups (p < 0.001), with the MHNW group having a younger median age compared to the other groups.Additionally, smoking status (smok_stat2) demonstrated a significant difference (p < 0.001) among the groups.The majority of individuals in all groups were non-smokers, but a higher proportion of current smokers was observed in the MUNW and MUO groups compared to the other groups.
No significant differences were found for TPO_stat (p = 0.928) and antiTg_stat (p = 0.880), indicating similar distributions among the groups.Ethnicity (RIDRETH2) displayed a significant difference (p < 0.001), with variations in the proportions of different racial and ethnic groups across the MHNW, MHO, MUNW, MUO, and unknown groups.
Furthermore, several laboratory measurements showed significant differences among the groups.These included serum albumin (LBXSAL, p < 0.001), serum globulin (LBXSGTSI, p < 0.001), total bilirubin (LBXSTB, p < 0.001), serum uric acid (LBXSUA, p < 0.001), and serum globulin (LBXSGB, p < 0.001).The FT4_T3_Ratio (p = 0.227) and TT4_T3_Ratio (p < 0.001) did not show significant differences among the groups.We employed multinomial logistic regression to analyze the associations between the metabolic phenotypes and the independent variables presented in the demographic table.Additionally, we utilized machine learning approaches, including neural network, random forest, boosted trees, lasso regression, and K nearest neighbors models, to explore and validate the relationships between the variables.The inclusion of these machine learning techniques allowed for a comprehensive evaluation of the predictive performance and potential insights gained from alternative modeling approaches.We evaluated the performance of each model based on metrics such as Hand-Till AUROC, accuracy, precision, recall, and F1 score (Table 4).
A confusion matrix that shows the distribution of test predictions against their true values is presented in Fig 3 .A gradient is overlayed onto the numeric values to produce a heatmap.It is clear that there is a class imbalance with MUNW being the largest class and MHO being the smallest class.The only model to predict the minority class is the multinomial logistic regression.

Multinomial logistic regression
The multinomial logistic regression model achieved the best performance with an AUROC of 0.818, indicating its ability to discriminate between classes.The model exhibited an accuracy of 0.655, precision of 0.559, recall of 0.502, and an F1 score of 0.510.These results suggest that the model performed moderately well in predicting the target variable.

Neural network
The neural network model yielded an AUROC of 0.814, demonstrating its discriminative power.The model achieved an accuracy of 0.655, precision of 0.632, recall of 0.487, and an F1 score of 0.640.These metrics suggest that the neural network performed similarly to the multinomial logistic regression model, showing competitive performance in predicting the target variable.

Random forest
The random forest model produced an AUROC of 0.811, indicating its ability to discriminate between classes.The model exhibited an accuracy of 0.641, precision of 0.647, recall of 0.439, and an F1 score of 0.599.These results suggest that the Random Forest model achieved reasonable predictive performance, albeit with a lower recall compared to the previous models.The most important predictors based on purity was HOMAQ_Q4 with an importance score of 88.2, followed closely by RIDAGEYR with an importance score of 84.8 (Fig 1).Other important predictors included LBXSUA (56.4), LBXSAL (49.7), and LBXSGTSI (48.5).Additionally, TT4_T3_Ratio (36.7) and FT4_T3_Ratio (32.3) were also notable contributors to the model.Among the demographic variables, RIAGENDR_Female (10.3) showed some influence, as did smok_stat2_Current.Smoker (7.5) and RIDRETH2_NH.Black (6.0).

Boosted trees
The boosted trees model also achieved an AUROC of 0.811, demonstrating its discriminative power.The model yielded an accuracy of 0.648, precision of 0.621, recall of 0.458, and an F1 score of 0.612.These results indicate that the Boosted Trees model performed similarly to the Random Forest model, with comparable predictive performance.

K nearest neighbors
The K nearest neighbors model obtained an AUROC of 0.786, indicating its discriminative ability.The model exhibited an accuracy of 0.604, precision of 0.621, recall of 0.379, and an F1 score of 0.512.These results suggest that the K nearest neighbors model achieved moderate predictive performance, but with a lower recall compared to the other models.

Lasso regression
The lasso regression model had the lowest AUROC with a value of 0.776.Although it was presented many interaction effects, most of the predictors were eliminated during regularization in the tuning stage.The model achieved an accuracy of 0.633, precision of 0.613, recall of 0.435, and an F1 score of 0.587.Despite its low AUROC curve, the model's F1 score shows a better balance of precision and recall than the K nearest neighbors, random forest, and multinomial logistic regression models.In summary, the multinomial logistic regression, neural network, random forest, boosted trees, and K nearest neighbors models were evaluated in terms of their performance metrics.The neural network and multinomial logistic regression models performed relatively well, achieving similar AUROC scores and demonstrating competitive accuracy, precision, recall, and F1 scores.The random forest and boosted trees models exhibited slightly lower AUROC scores but maintained reasonable predictive performance.Finally, the K nearest neighbors model achieved moderate performance, with lower recall compared to the other models.

Discussion
Our study reveals distinctive characteristics in the predictive performance of each machine learning model for metabolic phenotypes, as evident from their performance metrics.Both the Multinomial Logistic Regression and Neural Network models demonstrate moderate overall performance (AUROC of 0.818 and 0.814, respectively), with the former tending towards a balanced precision and recall, while the latter tends to prioritize precision over recall.In contrast, both the Random Forest and Boosted Trees methods exhibit competitive AUROC scores, underscoring their ability to capture underlying data patterns effectively.However, while the Random Forest model demonstrates higher precision, Boosted Trees display a more balanced precision-recall trade-off.Conversely, the K Nearest Neighbors model struggles with lower recall, indicating limitations in effectively identifying positive cases.Lastly, the Lasso Regression model, while maintaining decent precision, falls short in recall, suggesting potential challenges in capturing all instances of positive metabolic phenotypes.
In this study, parameters possibly related to both thyroid function and obesity were selected for ML variables.It is important to note that the performance of machine learning models can vary depending on the specific dataset, the quality and quantity of data, the chosen evaluation metrics, and other factors.Therefore, while Multinomial Logistic Regression performed best in this study, it does not imply that it will always be the top-performing model in every scenario.It is advisable to assess the suitability of different models based on the specific requirements and characteristics of the dataset at hand.Machine learning performs better in scenarios where large and complex datasets with numerous variables and intricate patterns need to be analyzed, as they excel at extracting valuable insights and making accurate predictions.They are well-suited for capturing non-linear relationships between variables, handling high-dimensional data by automatically learning relevant features or performing feature selection.Machine learning techniques are also capable of processing unstructured data such as text, images, audio, and video, enabling meaningful information extraction and classification.They can be utilized for real-time decision making, large-scale data processing, and prediction/classification tasks, demonstrating their versatility in various domains.However, the performance of machine learning algorithms depends on factors like data quality, feature engineering, model selection, hyperparameter tuning, and the expertise of the practitioner, necessitating careful assessment of specific requirements to determine the suitability of machine learning approaches [34].
When working with survey data such as NHANES, it is crucial to account for the complex survey design and ensure proper representation of the population.NHANES provides sample weights that allow researchers to estimate population-level statistics accurately.These weights adjust for sampling probabilities, non-response, and post-stratification factors.Failure to incorporate sample weights can lead to biased estimates and incorrect inferences.MacNell et al. [35] conducted a study examining the implications of incorporating sampling weights in gradient boosting models.Their findings emphasized the importance of accounting for sampling weights when analyzing data from complex surveys.The authors concluded that neglecting to consider sampling weights in gradient boosting models can potentially compromise the generalizability of the results, particularly depending on the sample size and other analytical characteristics.In cases where weighted algorithms are not readily available, the study suggests performing post-hoc re-calculations of unweighted model performance using weighted observed outcomes as an alternative approach.This methodology may provide a more accurate reflection of the model's predictions in target populations than completely disregarding the weights.The study by MacNell et al. underscores the significance of appropriately handling sampling weights to ensure valid and reliable analyses in complex survey settings.
Expanding the set of variables in the analysis can enhance the understanding and predictive power of the models.NHANES provides a rich array of health-related variables that capture various aspects of individuals' health status, behaviors, and demographic characteristics.By incorporating additional variables, researchers can potentially uncover important relationships, identify new risk factors, or gain insights into complex interactions.However, it is crucial to consider the relevance and validity of the additional variables and ensure they align with the research objectives and hypotheses for machine learning approaches.Careful variable selection and feature engineering techniques may be necessary to prevent issues such as multicollinearity and overfitting.Adding more variables can improve the models' accuracy and robustness, but it requires thoughtful consideration and domain expertise.Another limitation of our study is that although using NHANES data from different cycles for validation could potentially enhance the analytical value, a significant number of parameters used in our research were not included in data from other cycles.This discrepancy limited our ability to conduct a comprehensive validation.Furthermore, it's essential to acknowledge that while our findings are based on NHANES data, their generalizability to other datasets may be limited.Therefore, additional research is warranted to explore these relationships in different populations and settings, which could inform future investigations in this field.Lastly, there is a large class imbalance in the outcome variable, especially between MUNW (the majority class) and MHO (the minority class).This results in a scarcity of MHO predictions from the models.To counteract this, future work could incorporate the synthetic minority over-sampling technique (SMOTE) [36] to make sure that all outcome classes are represented in the predictions.SMOTE creates observations from the feature space of the minority class so that the synthetic data is within the realm of what is reasonable but not redundant.
Future studies should also explore the integration of machine learning techniques to address the challenges associated with survey sample weights and the incorporation of additional variables.Machine learning algorithms have the potential to handle complex survey data and leverage the power of sample weights effectively.Researchers can investigate novel approaches that specifically account for sample weights within machine learning models, ensuring accurate and representative analysis of survey data.A follow-up study could create Neural nets with more complex architecture than the one used in this study (multilayer perceptron) to model sophisticated relationships between predictors.Additionally, exploring the impact of incorporating more variables in machine learning models can uncover hidden patterns and relationships, enhancing prediction accuracy and providing a deeper understanding of the factors influencing the outcomes of interest.Future research in this domain can pave the way for innovative methodologies that combine machine learning, sample weights, and expanded variable sets, ultimately advancing the field of survey data analysis and enabling more robust and reliable insights.